Introduction:

In the realm of natural language processing, understanding the emotional context of text is a fascinating challenge. In this blog post, we embark on a journey to analyze emotions in textual data using Python. Our weapon of choice? A combination of emotion dataset and LIME for model interpretation.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# import gensim.utils as g_utils
import re
from nltk.tokenize.treebank import TreebankWordDetokenizer
import gensim.utils as gutils
/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"

Loading the Emotion Dataset:

To kick things off, we begin by loading our emotion dataset.

# Loading the Emotion Dataset
df = pd.read_csv('/kaggle/input/emotion-dataset/emotion-dataset.csv')
df.head()
text emotion
0 i didnt feel humiliated sadness
1 i can go from feeling so hopeless to so damned... sadness
2 im grabbing a minute to post i feel greedy wrong anger
3 i am ever feeling nostalgic about the fireplac... love
4 i am feeling grouchy anger
# Displaying Descriptive Statistics
df.describe()
text emotion
count 18000 18000
unique 17958 6
top i was so stubborn and that it took you getting... joy
freq 2 6057

Text Preprocessing: Streamlining Our Data for Analysis

In the intricate world of emotion analysis, the quality of textual data plays a pivotal role in model performance. To enhance the accuracy of our predictions, we employ a robust text preprocessing function.

def preprocess(data):
    
    # Removing URLs with a regular expression
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    data = url_pattern.sub(r'', data)

    # Remove emails
    data = re.sub('\S*@\S*\s?', '', data)

    # Remove new line characters
    data = re.sub('\s+', ' ', data)

    # Remove distracting single quotes
    data = re.sub("\'", "", data)
        
    return data

This preprocess function acts as a text janitor, systematically cleaning our data by removing URLs, emails, unnecessary whitespace, and distracting single quotes.

Text Transformation: From Sentences to Words and Back

The sent_2_words function acts as our linguistic alchemist, breaking down sentences into a stream of tokenized words, ready for analysis. On the flip side, the de_tokenize function smoothly reconstructs these processed words into coherent sentences, ensuring the integrity of our text data throughout the analysis pipeline.

def sent_2_words(sentences):
    """
    Tokenizes and preprocesses a list of sentences using Gensim's simple_preprocess.
    
    Parameters:
    - sentences: List of sentences to be tokenized.
    
    Returns:
    - A generator object yielding lists of words for each sentence.
    """
    for sentence in sentences:
        yield gutils.simple_preprocess(str(sentence), deacc=True)

def de_tokenize(text):
    """
    Detokenizes a list of words using TreebankWordDetokenizer.
    
    Parameters:
    - text: List of words to be detokenized.
    
    Returns:
    - Detokenized sentence.
    """
    return TreebankWordDetokenizer().detokenize(text)

Label Encoding for Emotions:

In this code snippet, the LabelEncoder from scikit-learn is employed to encode the categorical labels representing emotions in the DataFrame. The fit_transform method is applied to the 'emotion' column, creating a new column named 'emotion_label' with numerical representations of the corresponding emotions. This transformation is crucial for training machine learning models that require numerical input.

from sklearn.preprocessing import LabelEncoder

# Performing Label Encoding for Emotions
df['emotion_label'] = LabelEncoder().fit_transform(df['emotion'])
df.head()
text emotion emotion_label
0 i didnt feel humiliated sadness 4
1 i can go from feeling so hopeless to so damned... sadness 4
2 im grabbing a minute to post i feel greedy wrong anger 0
3 i am ever feeling nostalgic about the fireplac... love 3
4 i am feeling grouchy anger 0

Visualizing Emotion Distribution:

In this code snippet, the distribution of emotions within the DataFrame is visualized using a bar plot. The value_counts() method is applied to the 'emotion' column to obtain the frequency of each unique emotion, and the resulting counts are plotted using the plot function with the 'bar' kind. This visualization provides a quick overview of the distribution of emotions in the dataset.

df['emotion'].value_counts().plot(kind='bar')
<Axes: xlabel='emotion'>

Removing Duplicate Rows:

In this code snippet, duplicate rows in the DataFrame are identified using the duplicated() method, and their indices are stored in the index variable. Subsequently, the drop method is applied to remove the duplicate rows based on their indices, and the DataFrame is modified in place. Finally, the reset_index method is used to reindex the DataFrame after the removal of duplicate rows.

df[df['text'].duplicated() == True]
text emotion emotion_label
5067 i feel on the verge of tears from weariness i ... joy 2
6133 i still feel a craving for sweet food love 3
6563 i tend to stop breathing when i m feeling stre... anger 0
7623 i was intensely conscious of how much cash i h... sadness 4
7685 im still not sure why reilly feels the need to... surprise 5
8246 i am not amazing or great at photography but i... love 3
9596 ive also made it with both sugar measurements ... joy 2
9687 i had to choose the sleek and smoother feel of... joy 2
9769 i often find myself feeling assaulted by a mul... sadness 4
9786 i feel im being generous with that statement joy 2
10117 i feel pretty tortured because i work a job an... fear 1
10581 i feel most passionate about joy 2
11273 i was so stubborn and that it took you getting... joy 2
11354 i write these words i feel sweet baby kicks fr... love 3
11525 i feel a remembrance of the strange by justin ... fear 1
11823 i have chose for myself that makes me feel ama... joy 2
12441 i still feel completely accepted love 3
12562 i feel so weird about it surprise 5
12892 i cant escape the tears of sadness and just tr... joy 2
13236 i feel like a tortured artist when i talk to her anger 0
13879 i feel like i am very passionate about youtube... love 3
14106 i feel kind of strange surprise 5
14313 i could feel myself hit this strange foggy wall surprise 5
14633 i feel pretty weird blogging about deodorant b... fear 1
14925 i resorted to yesterday the post peak day of i... fear 1
15314 i will feel as though i am accepted by as well... joy 2
15328 i shy away from songs that talk about how i fe... joy 2
15571 i bet taylor swift basks in the knowledge that... anger 0
15704 i began to feel accepted by gaia on her own terms joy 2
15875 i was sitting in the corner stewing in my own ... anger 0
16261 i realized what i am passionate about helping ... joy 2
16264 i feel so blessed and honored that we get to b... love 3
16352 i could feel his breath on me and smell the sw... joy 2
16414 i loved the feeling i got during an amazing sl... joy 2
16501 i am feeling stressed and more than a bit anxious anger 0
16585 i found myself feeling inhibited and shushing ... sadness 4
16916 i feel the need to pimp this since raini my be... joy 2
16958 i feel cared for and accepted love 3
17025 i have not conducted a survey but it is quite ... sadness 4
17274 i feel so weird and scattered with all wonders... surprise 5
17886 i feel like some of you have pains and you can... joy 2
# Removing Duplicate Rows from the DataFrame
index = df[df['text'].duplicated() == True].index
df.drop(index, axis = 0, inplace = True)
df.reset_index(inplace=True, drop = True)
df[df['text'].duplicated() == True]
text emotion emotion_label
len(df)
17958

Training and Testing: The Divide for Model Mastery

As we embark on the exciting phase of model development, a critical step is to split our dataset into training and testing sets. In this code snippet, the 'text' column from the DataFrame is extracted to form the feature set (X_train and X_test), and the 'emotion' column is used to create the corresponding label sets (y_train and y_test). The dataset is divided into training and testing sets, with the first 15,000 entries used for training and the remaining entries for testing. This code snippet orchestrates the division of our dataset into training and testing sets—a fundamental step in the machine learning pipeline. The 'text' column serves as our feature set, while the 'emotion' column provides the corresponding labels. The first 15,000 entries are earmarked for training, and the remaining entries become our test set. This segregation ensures that our model is trained on a subset of the data and evaluated on unseen samples, gauging its generalization capabilities.

# Splitting the Dataset into Train and Test Sets
X_train = np.array(df['text'].values.tolist()[:15000])
X_test = np.array(df['text'].values.tolist()[15000:])
y_train = np.array(df['emotion'].values.tolist()[:15000])
y_test = np.array(df['emotion'].values.tolist()[15000:])
len(X_train) == len(y_train)
True
from sklearn.model_selection import train_test_split,KFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer

Model Training: Bridging Algorithms and Text Data

In the ever-evolving landscape of machine learning, training a model becomes an art form, especially when dealing with textual data. The train_model function creates and trains a text classification model using scikit-learn's Pipeline. It takes a model, data, and targets as input, where the model is a machine learning classifier, data represents the input features (text data), and targets are the corresponding labels. The function utilizes a TfidfVectorizer for text feature extraction and incorporates the given classifier into a pipeline.

def train_model(model, data, targets):
    """
    Trains a text classification model using scikit-learn's Pipeline.

    Parameters:
    - model: The machine learning classifier to be trained.
    - data: Input features (text data).
    - targets: Corresponding labels.

    Returns:
    - A trained Pipeline object containing TfidfVectorizer and the given model.
    """
    # Create a Pipeline object with a TfidfVectorizer and the given model
    text_clf = Pipeline([('vect', TfidfVectorizer()),
                         ('clf', model)])
    # Fit the model on the data and targets
    text_clf.fit(data, targets)
    return text_clf

Logistic Regression Model Evaluation:

In this code snippet, a Logistic Regression model is trained using the train_model function on the training data (X_train, y_train). Subsequently, the trained model is tested on the test data (X_test), and predictions are made. The accuracy of the model is calculated using the accuracy_score function. Additionally, the F1 score is computed for each emotion category and presented in a DataFrame.

#Train the model with the training data
log_reg = train_model(LogisticRegression(solver='liblinear',random_state = 0), X_train, y_train)
#test the model with the test data
y_pred=log_reg.predict(X_test)

#calculate the accuracy
log_reg_accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: ', log_reg_accuracy,'\n')

# Calculate and Display F1 Scores for Each Emotion
f1score = f1_score(y_test,y_pred, average=None)
pd.DataFrame(f1score, index=df.emotion.unique(), columns=['F1 Scores'])
Accuracy:  0.8492224475997295 

F1 Scores
sadness 0.831309
anger 0.803709
love 0.875170
surprise 0.688442
fear 0.897249
joy 0.602740
# Generate a detailed Classification Report
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

       anger       0.92      0.76      0.83       405
        fear       0.89      0.73      0.80       355
         joy       0.80      0.96      0.88      1002
        love       0.89      0.56      0.69       244
     sadness       0.86      0.93      0.90       857
    surprise       0.86      0.46      0.60        95

    accuracy                           0.85      2958
   macro avg       0.87      0.74      0.78      2958
weighted avg       0.86      0.85      0.84      2958

Decision Tree Model Evaluation:

In this code snippet, a Decision Tree model is trained using the train_model function on the training data (X_train, y_train). Subsequently, the trained model is tested on the test data (X_test), and predictions are made. The accuracy of the model is calculated using the accuracy_score function. Additionally, the F1 score is computed for each emotion category and presented in a DataFrame. Finally, a detailed classification report is generated using the classification_report function, providing insights into precision, recall, F1-score, and support for each emotion category.

# Train the model with the training data
dec_tree = train_model(DecisionTreeClassifier(random_state=0), X_train, y_train)

# Test the model with the test data
y_pred = dec_tree.predict(X_test)

# Calculate the accuracy
DTC_accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: ', DTC_accuracy, '\n')

# Calculate and Display F1 Scores for Each Emotion
f1score = f1_score(y_test, y_pred, average=None)
pd.DataFrame(f1score, index=df.emotion.unique(), columns=['F1 Scores'])

# Generate a detailed Classification Report
print(classification_report(y_test, y_pred))
Accuracy:  0.837052062204192 

              precision    recall  f1-score   support

       anger       0.86      0.86      0.86       405
        fear       0.83      0.82      0.82       355
         joy       0.84      0.87      0.86      1002
        love       0.74      0.73      0.73       244
     sadness       0.88      0.84      0.86       857
    surprise       0.62      0.69      0.66        95

    accuracy                           0.84      2958
   macro avg       0.80      0.80      0.80      2958
weighted avg       0.84      0.84      0.84      2958

#Train the model with the training data
SVM = train_model(SVC(random_state = 0), X_train, y_train)

#test the model with the test data
y_pred=SVM.predict(X_test)

#calculate the accuracy
SVM_accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: ', SVM_accuracy,'\n')

f1score = f1_score(y_test,y_pred, average=None)
pd.DataFrame(f1score, index=df.emotion.unique(), columns=['F1 Scores'])
Accuracy:  0.8522650439486139 

F1 Scores
sadness 0.830853
anger 0.811550
love 0.873821
surprise 0.682292
fear 0.902786
joy 0.657718
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

       anger       0.92      0.76      0.83       405
        fear       0.88      0.75      0.81       355
         joy       0.79      0.97      0.87      1002
        love       0.94      0.54      0.68       244
     sadness       0.88      0.93      0.90       857
    surprise       0.91      0.52      0.66        95

    accuracy                           0.85      2958
   macro avg       0.89      0.74      0.79      2958
weighted avg       0.86      0.85      0.85      2958

models = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree','Support Vector Machine'],
    'Accuracy': [log_reg_accuracy.round(2), DTC_accuracy.round(2), SVM_accuracy.round(2)]})

models.sort_values(by='Accuracy', ascending=False).reset_index().drop(['index'], axis=1)
Model Accuracy
0 Logistic Regression 0.85
1 Support Vector Machine 0.85
2 Decision Tree 0.84

Unveiling Model Predictions: A Lime Exploration

In the intricate realm of machine learning, understanding the decisions made by models is as crucial as the predictions themselves. Today, we delve into the world of Lime—Local Interpretable Model-agnostic Explanations—a powerful tool that sheds light on the inner workings of our predictive models.

Why Lime? Lime comes into play when we deal with complex models where understanding the decision process might be challenging. It acts as a bridge between the black-box nature of certain models and our need for interpretability. Lime provides insights into why a model makes specific predictions for individual instances, offering a transparent view into the otherwise opaque model.

import seaborn as sns
import nltk
#Lime
from lime import lime_text
from lime.lime_text import LimeTextExplainer
from lime.lime_text import IndexedString,IndexedCharacters
from lime.lime_base import LimeBase
from lime.lime_text import explanation
sns.set(font_scale=1.3)
nltk.download('omw-1.4')
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
True
df.emotion.unique()
array(['sadness', 'anger', 'love', 'surprise', 'fear', 'joy'],
      dtype=object)
dec_tree.classes_
array(['anger', 'fear', 'joy', 'love', 'sadness', 'surprise'], dtype='<U8')
X_test[15]
'i have this feeling whenever i write a song and if i think that the song has legs enough to be popular or for people to really respond to it i get this feeling'
np.array(df['emotion'].values.tolist()[15015])
array('joy', dtype='<U3')

Lime Explanation:

  1. Key Words: Lime highlights the significance of the terms "heartless" and "bitch" in contributing to the model's prediction. These words carry negative and potentially aggressive connotations, aligning with the predicted emotion of 'anger.'
  2. Negation Impact: The words "heartless" and "bitch" convey a negative sentiment, suggesting frustration, discontent, or anger. Lime recognizes the impact of these terms on the overall emotion prediction.
  3. Relative Importance: Lime's explanation may indicate that the strong negative terms have a substantial influence on the prediction of 'anger.' The intensity of emotion expressed in the input text contributes to the model's decision.
  4. Model Sensitivity: Lime demonstrates how the model reacts to variations in the input text, emphasizing the sensitivity to specific terms. Adjusting the text allows Lime to showcase the terms that significantly affect the predicted emotion.

Interpretation:

The model predicts 'anger' for the given custom text, and Lime's explanation highlights the presence of strongly negative terms such as "heartless" and "bitch." These terms contribute to the model's perception of anger or intense negative emotions. Lime provides clarity on the model's decision process, shedding light on the influential terms and their impact on the emotion prediction.

#c_LR = make_pipeline(tfidf, clf)
explainer_LR = LimeTextExplainer(class_names=dec_tree.classes_)
idx  = 56
print("Actual Text : ", X_test[idx])
print("Prediction : ", dec_tree.predict(X_test)[idx])
print("Actual :     ", y_test[idx])
exp = explainer_LR.explain_instance(X_test[idx], dec_tree.predict_proba,top_labels=5)
exp.show_in_notebook()
Actual Text :  i just want the best for that boy maybe i can really stop feeling like im a heartless bitch
Prediction :  anger
Actual :      anger

Lime Explanation:

  1. Key Words: Lime highlights the importance of the words "not sad" and "happy" in contributing to the model's prediction. These terms carry positive sentiment, aligning with the predicted emotion of 'joy.'
  2. Negation Impact: The presence of negation terms like "not" influences the prediction. Lime recognizes the significance of negating the emotion "sad," effectively reinforcing the positive sentiment associated with "happy."
  3. Relative Importance: Lime assigns relative importance scores to different terms, emphasizing their impact on the model's decision. In this case, the words "not sad" and "happy" receive higher importance, contributing positively to the 'joy' prediction.
  4. Model Sensitivity: Lime explores the model's sensitivity to variations in the input text. By adjusting the text and observing changes in the prediction, Lime sheds light on how specific terms influence the outcome.

Interpretation:

The model predicts 'joy' for the given custom text, and Lime's explanation reinforces the positive sentiment embedded in the words "not sad" and "happy." The negation of "sad" contributes significantly to the positive emotion prediction. Lime provides a transparent view into the model's decision process, highlighting the influential terms and their impact on the predicted emotion.

custom_x = np.array(['i am not sad, but happy, yes not very sad'])
#c_LR = make_pipeline(tfidf, clf)
explainer_LR = LimeTextExplainer(class_names=dec_tree.classes_)
idx  = 56
print("Actual Text : ", custom_x)
print("Prediction : ", dec_tree.predict(custom_x)[0])
print("Actual :     ", 'joy')
exp = explainer_LR.explain_instance(custom_x[0], dec_tree.predict_proba,top_labels=5)
exp.show_in_notebook()
Actual Text :  ['i am not sad, but happy, yes not very sad']
Prediction :  joy
Actual :      joy
#c_LR = make_pipeline(tfidf, clf)
explainer_LR = LimeTextExplainer(class_names=dec_tree.classes_)
idx  = 15
print("Actual Text : ", X_test[idx])
print("Prediction : ", dec_tree.predict(X_test)[idx])
print("Actual :     ", y_test[idx])
exp = explainer_LR.explain_instance(X_test[idx], dec_tree.predict_proba,top_labels=5)
exp.show_in_notebook()
Actual Text :  i have this feeling whenever i write a song and if i think that the song has legs enough to be popular or for people to really respond to it i get this feeling
Prediction :  joy
Actual :      joy
#c_LR = make_pipeline(tfidf, clf)
explainer_LR = LimeTextExplainer(class_names=dec_tree.classes_)
idx  = 157
print("Actual Text : ", X_test[idx])
print("Prediction : ", dec_tree.predict(X_test)[idx])
print("Actual :     ", y_test[idx])
exp = explainer_LR.explain_instance(X_test[idx], dec_tree.predict_proba,top_labels=5)
exp.show_in_notebook()
Actual Text :  i feel so vain when i look at myself and notice how much i like my nose or how nice my face structure is
Prediction :  sadness
Actual :      sadness
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dense, LSTM, Embedding, Bidirectional
temp = []
#Splitting pd.Series to list
data_to_list = df['text'].values.tolist()
for i in range(len(data_to_list)):
    temp.append(preprocess(data_to_list[i]))
data_words = list(sent_2_words(temp))
data = []
for i in range(len(data_words)):
    data.append(de_tokenize(data_words[i]))
print(data[:5])
['didnt feel humiliated', 'can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake', 'im grabbing minute to post feel greedy wrong', 'am ever feeling nostalgic about the fireplace will know that it is still on the property', 'am feeling grouchy']
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras import regularizers

max_words = 5000
max_len = max([len(t) for t in data])

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(data)
sequences = tokenizer.texts_to_sequences(data)
msgs = pad_sequences(sequences, maxlen=max_len)
print(msgs)
[[   0    0    0 ...  133    1  637]
 [   0    0    0 ...    2   19 1333]
 [   0    0    0 ...    1  457  407]
 ...
 [   0    0    0 ...    5    7 3267]
 [   0    0    0 ...   46    8 2490]
 [   0    0    0 ...  297    3  297]]
X_train = np.array(msgs[:15000])
X_test = np.array(msgs[15000:])
y_train = np.array(df['emotion_label'].values.tolist()[:15000])
y_test = np.array(df['emotion_label'].values.tolist()[15000:])
df.head()
text emotion emotion_label
0 i didnt feel humiliated sadness 4
1 i can go from feeling so hopeless to so damned... sadness 4
2 im grabbing a minute to post i feel greedy wrong anger 0
3 i am ever feeling nostalgic about the fireplac... love 3
4 i am feeling grouchy anger 0
vocabSize = len(tokenizer.index_word) + 1
!wget http://nlp.stanford.edu/data/glove.6B.zip
--2024-01-02 15:45:36--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-01-02 15:45:36--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-01-02 15:45:36--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’

glove.6B.zip        100%[===================>] 822.24M  5.02MB/s    in 2m 39s  

2024-01-02 15:48:15 (5.17 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

!unzip glove*.zip
Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       

Loading GloVe Embeddings and Creating Embedding Matrix

In this code snippet, GloVe word embeddings are loaded from a specified file (path_to_glove_file). The embeddings are parsed, and a dictionary (embeddings_index) is created, mapping words to their corresponding embedding vectors. Subsequently, an embedding matrix is constructed using this pre-trained GloVe data. The matrix is shaped to match the vocabulary size (num_tokens) and embedding dimension (embedding_dim). Words from the tokenizer's vocabulary are assigned their respective embedding vectors if available; otherwise, the embedding is set to zero.

# Read GloVE embeddings
path_to_glove_file = '/kaggle/working/glove.6B.200d.txt'
num_tokens = vocabSize
embedding_dim = 200 #latent factors or features  
hits = 0
misses = 0
embeddings_index = {}
# Load GloVe word embeddings from the specified file
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs
print("Found %s word vectors." % len(embeddings_index))

# Initialize an embedding matrix for our neural network
embedding_matrix = np.zeros((num_tokens, embedding_dim))

# Assign pre-trained word vectors to our vocabulary
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words found in the embedding index are assigned their respective vectors
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        # Words not found in the embedding index are set to all-zeros
        # This includes the representation for "padding" and "OOV" (Out of Vocabulary)
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))
Found 400000 word vectors.
Converted 30056 words (2158 misses)
X_train.shape
(15000, 294)

Neural Network Architecture: Bidirectional LSTM

In this code snippet, a neural network model is constructed using a Bidirectional Long Short-Term Memory (Bi-LSTM) architecture for emotion analysis. The model is compiled using the sparse categorical cross-entropy loss function and the RMSprop optimizer. The architecture comprises an embedding layer, three Bidirectional LSTM layers with varying dropout rates, and a final dense layer with a softmax activation function for multi-class classification.

# Define the sequential model
model = Sequential()

# Add an embedding layer with pre-trained weights and fixed trainable status
model.add(Embedding(vocabSize, 200, input_length=X_train.shape[1], weights=[embedding_matrix], trainable=False))

# Stack three Bidirectional LSTM layers with varying dropout rates
model.add(Bidirectional(LSTM(256, dropout=0.2, recurrent_dropout=0.2, return_sequences=True)))
model.add(Bidirectional(LSTM(128, dropout=0.3, recurrent_dropout=0.3, return_sequences=True)))
model.add(Bidirectional(LSTM(128, dropout=0.5, recurrent_dropout=0.5)))

# Add a dense layer with softmax activation for multi-class classification
model.add(Dense(6, activation='softmax'))

# Compile the model with sparse categorical cross-entropy loss and RMSprop optimizer
model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

# Display a summary of the model architecture
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 294, 200)          3221600   
                                                                 
 bidirectional (Bidirectiona  (None, 294, 512)         935936    
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 294, 256)         656384    
 nal)                                                            
                                                                 
 bidirectional_2 (Bidirectio  (None, 256)              394240    
 nal)                                                            
                                                                 
 dense (Dense)               (None, 6)                 1542      
                                                                 
=================================================================
Total params: 5,209,702
Trainable params: 1,988,102
Non-trainable params: 3,221,600
_________________________________________________________________
from keras.utils.vis_utils import plot_model
plot_model(model, show_shapes=True)

#to stop the training when the loss starts to increase
callback = EarlyStopping(
    monitor="val_loss",
    patience=4,
    restore_best_weights=True,
)

Training Model and Achieving 90% Accuracy:

This code snippet encapsulates the training process of your Bidirectional LSTM model. The achieved accuracy of 90% is indicative of the model's effectiveness in learning and generalizing from the provided emotional text data.

# Fit model
history = model.fit(X_train,
                    y_train,
                    validation_data=(X_test, y_test),
                    verbose=1,
                    batch_size=256,
                    epochs=15,
                    callbacks=[callback]
                   )
Epoch 1/5
59/59 [==============================] - 349s 6s/step - loss: 0.3502 - accuracy: 0.8705 - val_loss: 0.3633 - val_accuracy: 0.8611
Epoch 2/5
59/59 [==============================] - 349s 6s/step - loss: 0.3106 - accuracy: 0.8816 - val_loss: 0.2811 - val_accuracy: 0.8884
Epoch 3/5
59/59 [==============================] - 349s 6s/step - loss: 0.2844 - accuracy: 0.8884 - val_loss: 0.2763 - val_accuracy: 0.8925
Epoch 4/5
59/59 [==============================] - 349s 6s/step - loss: 0.2492 - accuracy: 0.9024 - val_loss: 0.2546 - val_accuracy: 0.9013
Epoch 5/5
59/59 [==============================] - 348s 6s/step - loss: 0.2324 - accuracy: 0.9101 - val_loss: 0.2418 - val_accuracy: 0.9009

Conclusion:

In our expedition through the vast realm of machine learning, our venture into the world of emotion analysis has been a captivating odyssey filled with discovery, experimentation, and innovation. From the foundational steps of preparing our data to the intricate choreography between interpretable models and pre-trained embeddings, each stage has played a vital role in crafting a resilient and insightful emotion analysis system.

The harmony between conventional machine learning models and more advanced architectures, such as Bidirectional LSTMs, has uncovered the potency of contextual comprehension. This revelation allows us to unravel the complexities embedded within textual data, shedding light on the nuanced expressions of emotion. Our exploration into tools like Lime has acted as a guiding lantern, casting clarity on the decision-making mechanisms of these models, fostering transparency and trust in our analytical journey.

The significance of pre-trained word embeddings, especially the GloVe embeddings, cannot be overstated. They have proven to be the lifeblood of our models, infusing words with semantic richness. This infusion enables our neural network to not just understand but truly feel the subtle shades of emotion encoded in text, achieving an impressive accuracy of 90%.